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Abstract 



This paper establishes a nearly optimal algorithm for estimating the frequencies and am- 
plitudes of a mixture of sinusoids from noisy equispaced samples. We derive our algorithm by 
viewing line spectral estimation as a sparse recovery problem with a continuous, infinite dic- 
tionary. We show how to compute the estimator via semidefinitc programming and provide 
guarantees on its mean-square error rate. We derive a complementary minimax lower bound on 
this estimation rate, demonstrating that our approach nearly achieves the best possible estima- 
tion error. Furthermore, we establish bounds on how well our estimator localizes the frequencies 
in the signal, showing that the localization error tends to zero as the number of samples grows. 
We verify our theoretical results in an array of numerical experiments, demonstrating that the 
semidefinite programming approach outperforms two classical spectral estimation techniques. 

Keywords: Approximate support recovery, Atomic norm, Compressive sensing, Infinite dictionary, 
Line spectral estimation, Minimax rate, Sparsity, Stable recovery, Superresolution 

1 Introduction 

Spectrum estimation is one of the fundamental problems in statistical signal processing. Despite of 
hundreds of years of research on this subject, there still remain several fundamental open questions 
in this area. This paper addresses a central one of these problems: how well can we determine 
the locations and magnitudes of spectral lines from noisy temporal samples? In this paper, we 
establish lower bounds on how well we can recover such signals and demonstrate that these worst 
case bounds can be nearly saturated by solving a convex programming problem. Moreover, we 
prove that the estimator approximately localizes the frequencies of the true spectral lines. 

We consider signals whose spectra consist of spike trains with unknown locations in a normalized 
interval T = [0, 1]. Consider n = 2m + 1 equispaced samples of a mixture of sinusoids given by 



where j £ {— m, . . . , m}. We assume that the support T = {fi}f =1 C T of the k frequencies and 
the corresponding complex amplitudes {q}J ! _ 1 are unknown. We observe noisy samples y = x* + w 
where the noise components Wi are i.i.d. centrally symmetric complex Gaussian variables with 
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(1.1) 



l=i 
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variance a 2 . By swapping the roles of frequency and time or space, the signal model (1.1) also 
serves as a proper model for superresolution imaging where we aim to localize temporal events 
or spatial targets from noisy, low- frequency measurements |XT[2| . Our first result characterizes the 



denoising error — — and is summarized in the following theorem. 



Theorem 1. Suppose the line spectral signal x* is given by ( 1.1 ) and we observe n noisy consecutive 
samplesyj = Xj+Wj wherewj isi.i.d. complex Gaussian with variance a 2 . If the frequencies {fi}f =1 
in x* satisfy a minimum separation condition 

mmd{f p ,f q ) > 4/n (1.2) 
with d(-,-) the distance metric on the torus, then we can determine an estimator x satisfying 

l M-x^\\ 2 = o(a 2 kl ° g{n) \ (1.3) 



n \ n 

with high probability by solving a semidefinite programming problem. 

Note that if we exactly knew the frequencies fj, the best rate of estimation we could achieve 
would be 0(a 2 k/n) |j. Our upper bound is merely a logarithmic factor larger than this rate. On 
the other hand, we will demonstrate via minimax theory that a logarithmic factor is unavoidable 
when the support is unknown. Hence, our estimator is nearly minimax optimal. 

It is instructive to compare our stability rate to the optimal rate achievable for estimating a 
sparse signal from a finite, discrete dictionary |4j. In the case that there are p incoherent dictionary 
elements, no method can estimate a A;-sparse signal from n measurements corrupted by Gaussian 
noise at a rate less than 0{a 2 ), j n our problem, there are an infinite number of candidate 

dictionary elements and it is surprising that we can still achieve such a fast rate of convergence with 
our highly coherent dictionary. We emphasize that none of the standard techniques from sparse 
approximation can be immediately generalized to our case. Not only is our dictionary infinite, but 
also it does not satisfy the usual assumptions such as restricted eigenvalue conditions [5j| or coherence 
conditions [6] that are used to derive stability results in sparse approximation. Nonetheless, in 
terms of mean-square error performance, our results match those obtained when the frequencies 
are restricted to lie on a discrete grid. 

In the absence of noise, polynomial interpolation can exactly recover a line spectral signal of 
k arbitrary frequencies with as few as 2k equispaced measurements. In the light of our minimum 



frequency separation requirement (1.2), why should one favor convex techniques for line spectral 
estimation? Our stability result coupled with minimax optimality establish that no method can 
perform better than convex methods when the frequencies are well-separated. And, while polyno- 
mial interpolation and subspace methods do not impose any resolution limiting assumptions on the 
constituent frequencies, these methods are empirically highly sensitive to noise. To the best of our 
knowledge, there is no result similar to Theorem [I] that provides finite sample guarantees about 
the noise robustness of polynomial interpolation techniques. 

Additionally, little is known about how well spectral lines can be localized from noisy observa- 
tions. The frequencies estimated by any method will never exactly coincide with the true frequencies 
in the signal in the presence of noise. However, we can characterize the localization performance 
of our convex programming approach, and summarize this performance in Theorem [2] 

Before stating the theorem, we introduce a bit of notation. Define neighborhoods Nj around 
each frequency fj in x* by Nj := {/ E T : d(f, fj) < 0.16/n}. Also define F = T\uJ =1 Nj as the set 
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of frequencies in T which are not near any true frequency. The letters N and F denote the regions 
that are near to and far from the true supporting frequencies. The following theorem summarizes 
our localization guarantees. 

Theorem 2. Let x be the solution to the same semidefinite programming (SDP) problem as refer- 
enced in Theorem^ and n > 256. Let c\ and fi form the decomposition of x into coefficients and 
frequencies, as revealed by the SDP. Then, there exist fixed numerical constants C\ , C2 and C3 such 
that with high probability 
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iv.) If for any frequency fj, the corresponding amplitude \cj\ > C\0\j^ L - } then with high 
probability there exists a corresponding frequency fj in the recovered signal such that, 
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Part (i) of Theorem [2] shows that the estimated amplitudes corresponding to frequencies far 
from the support are small. In practice, we note that we rarely find any spurious frequencies in 
the far region, suggesting that our bound (i) is conservative. Parts (ii) and (iii) of the theorem 
show that in a neighborhood of each true frequency, the recovered signal has amplitude close to 
the true signal. Part (iv) shows that the larger a particular coefficient is, the better our method is 

able to estimate the corresponding frequency. In particular, note that if \cj\ > 2C\a\l fc ^°^ n \ then 



fj ~ fj 
of samp 



< In all four parts, note that the localization error goes to zero as the number 



es grows. 

We proceed as follows. In Section [2j we begin by contextualizing our result in the canon of 
line spectral estimation. We emphasize the advantages and shortcomings of prior art, and describe 
the methods on which our analysis is built upon. We then in Section [3] describe the semidefinite 
programming approach to line spectral estimation, originally introduced in |7], and explain how 
it relates to other recent spectrum estimation algorithms. We present minimax lower-bounds for 
line spectral estimation in Section [4j We then provide the proofs of our main results in Section [5| 
Finally, in Section |6j we empirically demonstrate that the semidefinite programming approach 
outperforms MUSIC (§] and Cadzow's technique [9] in terms of the localization metrics defined by 
parts (i), (ii) and (iii) of Theorem [2j 

2 Prior Art in Line Spectral Estimation 

To date, line spectral analysis may be broadly classified into two camps. Subspace methods [8]-|ll| 
build upon polynomial interpolation |12| and exploit certain low rank structure in the spectrum 
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estimation problem for denoising. Research on subspace approaches has yielded several standard 
algorithms that are widely deployed and shown to achieve Cramer-Rao bound asymptotically (131 



14 . However, the sensitivity to noise and model order is not well understood, and there are few 
guarantees of how these algorithms perform given a limited number of noisy measurements. For a 
review of many of these classical approaches, see for example [15| . 

More recently, approaches based on convex optimization have gained favor and have been 



demonstrated to perform well on a variety of spectrum estimation tasks 16-19 . These convex 



programming methods restrict the frequencies to lie on a finite grid of points and view line spectral 
signals as a sparse combination of single frequencies. While these methods are reported to have 
significantly better localization properties than subspace methods (see for example, |16| ) and admit 
fast and robust algorithms, they have two significant drawbacks. First, while finer gridding may 
lead to better performance, very fine grids are often numerically unstable. Furthermore, tradi- 
tional compressed sensing theory does not adequately characterize the performance of fine gridding 
in these algorithms as the dictionary becomes highly coherent. 

Some very recent work [T||2||7] bridges the gap between the performant discretized algorithms 
and continuous subspace approaches by developing a new theory of convex relaxations for infinite 
continuous dictionary of frequencies. Our work in (7j applies the atomic norm framework proposed 
by Chandrasekaran et al |20| to the line spectral estimation problem. There, we established stability 
results on the denoising error and demonstrated empirically that our algorithm compared favorably 
with both the classical and recent convex approaches which assume the frequencies are on an 
oversampled DFT grid. Our prior results made no assumption about the separation between 
frequencies. When the frequencies are well separated, the current work demonstrates that much 
faster convergence rates are achieved. 

Our work is closely related to recent results established by Candes and Fernandez- Granda [l] 
on exact recovery using convex methods and their recent work [2] on exploiting the robustness of 
their dual polynomial construction to show super-resolution properties of convex methods. The 
total variation norm formulation used in [2] is equivalent to the atomic norm specialized to the line 
spectral estimation problem. 

Robustness bounds were established in both our earlier work [7] and in the work of Candes and 
Fernandez-Granda [2]. In |7j, a slow convergence rate was established with no assumptions about 
the separation of frequencies in the true signal. In [2], the authors provide guarantees on the L\ 
energy of error in the frequency domain in the case that the frequencies are well separated. The 
noise is assumed to be adversarial with a small L\ spectral energy. In contrast, our paper shows 
near minimax denoising error under Gaussian noise. It is also not clear that there is a computable 
formulation for the optimization problem analyzed in |2| . While the guarantees the authors derive 
in [2] are not comparable with our results, several of their mathematical constructions are used in 
our proofs here. 

Additional recent work derives conditions for approximate support recovery under the Gaussian 
noise model using the Beurling-Lasso [21] . There, the authors show that there is a true frequency 
in the neighborhood of every estimated frequency with large enough amplitude. We note that the 
Beurling-Lasso is equivalent to the atomic norm algorithm that we analyze in this paper. A more 
recent paper by Fernandez-Granda 22 improves this result by giving conditions on recoverability 
in terms of the true signal instead of the estimated signal and prove a theorem similar to Theorem [2] 
but use a worst case L2 bound on the noise samples. Here, we improve these recent results in our 
proof of Theorem [2] providing tighter guarantees under the Gaussian noise model. 
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3 Frequency Localization using Atomic Norms 

We describe more precisely our signal model in this section. Suppose we wish to estimate the 
amplitudes and frequencies of a signal x(t),t G K given as a mixture of k complex sinusoids: 

k 

x(t) = y y ci exp(i27r/;t) 
1=1 

where {q}^ =1 are unknown complex amplitudes corresponding to the k unknown frequencies {/z}f =1 
assumed to be in the torus T = [0, 1]. Such a signal may be thought of as a normalized band limited 
signal and has a Fourier transform given by a line spectrum: 



m(/) = E^(/-/,; 



(3.1) 



l=i 



Denote by x* the n = 2m + 1 dimensional vector composed of equispaced Nyquist samples 

Mi)}j=-m for 3 = ~ m , ■••,m. 

The goal of line spectral estimation is to estimate the frequencies and amplitudes of the signal 
x(t) from the finite, noisy samples y G C n given by 

Vj = x ) + w i 

for —m < j < m, where Wj ~ CM(0, a 2 ) is i.i.d. circularly symmetric complex Gaussian noise. 
3.1 Algorithm: Atomic Norm Soft Thresholding (AST) 

We can model the line spectral observations x* = [x*_ m , . . . ,x^] T G C n as a sparse combination 
of "atoms" a(f) which correspond to observations due to single frequencies. Define the vector 
a(f) G <C n for any / G T = [0, 1] by 



a(f) 



Then, we rewrite model (1.1) as follows: 

k 



(3.2) 



1=1 



i=i 



where <f>i = c\/\ci\ is the phase of the lib. component. So, the target signal x* may be viewed as a 
sparse non-negative combination of elements from the atomic set A given by 



A=[a(fy*Je [0,1], 0G [0,2tt]}. 



(3.3) 
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For a general atomic set A, the atomic norm of a vector is denned as the gauge function associated 
with the convex hull conv(„4) of atoms: 

\\z\\a = inf {t > : z G tconv(A)} = inf < c a : z = c a a, a £ A,c a > 0> (3-4) 



The authors in 20 justify the use of atomic norm as a general penalty function to promote 
sparsity in an infinite dictionary A. This generalizes various forms of sparsity. For example, the £ 
norm [23] for sparse vectors is an atomic norm corresponding to the atomic set formed by canonical 
unit vectors. The nuclear norm |24| for low rank matrices is an atomic norm induced by the atomic 
set of unit-norm rank-1 matrices. 

In this paper, we analyze the performance of the atomic norm soft thresholding (AST) estimate: 

x = argmin -lly — zllo + t||z||_4 (3-5) 
z 2 



where the atomic norm || • ||^ corresponds to the atomic set in (3.3), and r is a suitably chosen 
regularization parameter. The corresponding dual problem is interesting because it gives a way of 
localizing the frequencies in an atomic norm achieving decomposition of x. The dual problem of 
AST is given by the following semi-infinite program: 

maximize - \\y 2 - - \\y - rq 2 

q 2 2 

subject to sup \{q,a(f)}\ < 1 (3-6) 
fey 

It is convenient to associate a trigonometric polynomial Q(f) = (q,a(f)} with the optimal solution 
q of the dual problem. As discussed in [7] , the frequencies in the support of the solution x can be 
identified by finding points on the torus T where Q has a magnitude of unity. We use 

f = 5>°(/i) (3.7) 
I 

to denote the decomposition of x given by the dual polynomial Q(f). 

We show in [7j that a good choice of r for obtaining accelerated convergence rates is 

t = r\a\J nlog(n) (3-8) 

for some rj £ (l,oo). We shall use this choice of regularization parameter throughout this paper. 



Remark. As shown in Section III. A of our prior work VN, problem (3.5) is equivalent to the 
semidefinite programming problem 

1 r 

minimize -\\y — z\\\ + — (t + u\) (3-9) 

z,u,t 2 2 

subject to ( T0 ^ (U) f) t 0. (3.10) 
where Toep(u) denotes a Hermitian Toeplitz matrix with u as its first row and u\ is the first 



component ofu. Similarly, the dual semi-infinite program (3.6) is equivalent to the dual semidefinite 



program of (3.9) 
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4 What is the best rate we can expect? 



Using results about minimax achievable rates for linear models (4,25, we can deduce that the 
convergence rate stated in (|1.3|) is near optimal. Define the set of k well separated frequencies as 



S k = {(/i,..., fk)^T k | d(f p ,f q )>4/n,p^q} 



The expected minimax denoising error Mfc for a line spectral signal with frequencies from 5/% is 
defined as the lowest expected denoising error rate for any estimate x(y) for the worst case signal 
x* with support T(x*) G Sk- Note that we can lower bound Mfc by restricting the set of candidate 
frequencies to smaller set. To that end, suppose we restrict the signal x* to have frequencies only 
drawn from an equispaced grid on the torus T n := {4?/n}!^. Note that any set of k frequencies 
from T n are pairwise separated by at least 4/n. If we denote by F n a n x (n/4) partial DFT matrix 
with (unnormalized) columns corresponding to frequencies from T n , we can write x* = F n c* for 
some c* with ||c*||o = k. Thus, 

M fc := inf sup -E\\x - x 

1 „ 
> inf sup — E||x — F n c \\ 2 
£ ||c*|| <fc n 

>inf sup -E\\F n (c- c*)\\\ 

6 ||c*|| <fc n 



n 4 
> — ^ inf sup — E||c 

l|c*|| <fc n 



4 




Here, the first inequality is the restriction of T(x*). The second inequality follows because we 
project out all components of x that do not lie in the span of F n . Such projections can only reduce 
the Euclidean norm. The third inequality uses the fact that the minimum singular value of F n is 
n since F*F n = nl n /^. Now we may directly apply the lower bound for estimation error for linear 
models derived by Candes and Davenport. Namely, Theorem 1 of 14] states that 

inf sup 4 E||c-cli>Ca 2 H ^,f) . 

c iic*i| <fc n If«IIf 

With the preceding analysis and the fact that H-F^Hf = n 2 /4, we can thus deduce the following 
theorem: 



Theorem 3. Let x* be a line spectral signal as described by (1.1) with the support T(x*) = 
{fi, ■ ■ ■ , fk} G <5fc an d y = x* + w, where w E C n is circularly symmetric Gaussian noise with 
variance cr 2 I n . Let x be any estimate of x* using y. Then, 

M fc = inf sup -E\\x-x*\\ 2 2 >Ca 2 

£ T(x*)es k n n 

for some constant C that is independent of k, n, and a. 

This theorem and Theorem [l] certify that AST is nearly minimax optimal for spectral estimation 
of well separated frequencies. 
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5 Proofs of Main Theorems 



In this section, there are many numerical constants. Unless otherwise specified, C will denote a 
numerical constant whose value may change from equation to equation. Specific constants will be 
highlighted by accents or subscripts. 

We describe the preliminaries and notations, and restate some recent results we used before 
sketching the proof of Theorems [Tj and [2j 



5.1 Preliminaries 

The sample x£ may be regarded as the jth trigonometric moment of the discrete measure \i given 



by q3JJ>: 

-1 



o 



for —m < j < m. Thus, the problem of extracting the frequencies and amplitudes from noisy obser- 
vations may be regarded as the inverse problem of estimating a measure from noisy trigonometric 
moments. 

We can write the vector x* of observations [x* m , . . . , in terms of an atomic decomposition 



1=1 



or equivalently in terms of a corresponding representing measure fi given by (3.1) satisfying 



x*= I a{fMdf) 
Jo 

There is a one-one correspondence between atomic decompositions and representing measures. Note 
that there are infinite atomic decompositions of x* and also infinite corresponding representing 
measures. However, since every collection of n atoms is linearly independent, A forms a full spark 



frame 26 and therefore the problem of finding the sparsest decomposition of x* is well-posed if 



there is a decomposition which is at least n/2 sparse. 



The atomic norm of a vector z defined in (3.4) is the minimum total variation norm |27l 28 



||^||tv °f all representing measures [l of z. So, minimizing the total variation norm is the same as 
finding a decomposition that achieves the atomic norm. 



5.2 Dual Certificate and Exact Recovery 

Atomic norm minimization attempts to recover the sparsest decomposition by finding a decomposi- 
tion that achieves the atomic norm, i.e., find q, fi such that x* = Y^i Q a (fi) and ||x*||^ = J2i or 
equivalently, finding a representing measure fi of the form (3.1) that minimizes the total variation 
norm ||//||tv- T ne authors of [l] showed that when n > 256, the decomposition that achieves the 
atomic norm is the sparsest decomposition by explicitly constructing a dual certificate 29 of opti- 
mality, whenever the composing frequencies fx,. . . , /& satisfy a minimum separation condition ( 1.2 ). 
In the rest of the paper, we always make the technical assumption that n > 256. 
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Definition 1 (Dual Certificate). A vector q E C n zs called a dual certificate for x* if for the 

corresponding trigonometric polynomial Q(f) := {q,a(f)) ; we have 

Q(fl) = sign(q),Z = 1, . . . , k 

and 

\Q(f)\ < i 

whenever f ^ {fi, ■ ■ ■ , fk}- 

The authors of [I] not only explicitly constructed such a certificate characterized by the dual 
polynomial Q, but also showed that their construction satisfies some stability conditions, which is 
crucial for showing that denoising using the atomic norm provides stable recovery in the presence 
of noise. 

Theorem 4 (Dual Polynomial Stability, Lemma 2.4 and 2.5 in [2]). For any fi, ■ ■ ■ , fk satisfying 
the separation condition (1.2) and any sign vector v E C k with \vj\ = 1, there exists a trigonometric 



polynomial Q = (q, a(/)) for some q G C n with the following properties: 

1. For each j = 1, . . . , k, Q interpolates the sign vector v so that Q{fj) = Vj 

2. In each neighborhood Nj corresponding to fj defined by Nj = {/ : d(f,fj) < 0.16/n} ; the 
polynomial Q(f) behaves like a quadratic and there exist constants C a ,C' a so that 



\Q(f)\<l-^(f - f,) 2 (5..1) 

a 



\Q(f)-Vj\<^n 2 (f-f)) 2 (5.2) 
3. When f G F = [0, 1]\ u) =1 Nj, there is a numerical constant > such that 

\Q(f)\<i-c b 

We use results in [2] and [7] (reproduced in Appendix D for convenience) and borrow several 
ideas from the proofs in [2j, with nontrivial modifications to establish the error rate of atomic norm 
regularization. 

5.3 Proof of Theorem [T] 



Let ft be the representing measure for the solution x of (3.5) with minimum total variation norm, 
that is, 



x 



C a{f)m) 
Jo 



and 1 1 & H.4 = ||/}||tv- Denote the error vector by e = x* — x. Then, the difference measure v = fi — ft 
is a representing measure for e. We first express the denoising error ||e||2 as the integral of the error 
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function E(f) = (e,a(f)}, against the difference measure v. 



(e,e) 



a(fHdf) 



(e,a(f))v(df) 



E(fHdf). 

Using a Taylor series approximation in each of the near regions Nj, we first show that the 
denoising error (or in general any integral of a trigonometric polynomial against the difference 
measure) can be controlled in terms of an integral in the far region F and the zeroth, first, and 
second moments of the difference measure in the near regions. The precise result is presented in 
the following lemma, whose proof is given in Appendix \K\ 

Lemma 1. Define 



y{df) 



n 
n' 

T 

k 



(f-fiHdf) 



N, 



A', 



(f-fj) 2 WM) 



i, 



^2 If, for I = 0,1,2. 

Then for any mth order trigonometric polynomial X, we have 

1 X(f)u(df)<\\X(f)\\, 



j F \v\{df) + h + h + I^j 



Applying Lemma [T] to the error function, we get 



< ||W)|| e 



J F \v\(df) + h+h+I^j 



(5.3) 



As a consequence of our choice of r in (3.8), we can show that ||^(/)|| 00 < (1 + 2rj 1 )r with high 
probability. In fact, we have 

||£(/)||ao= sup |(e,a(/)>| 

/6[0,1] 

= sup \(x* - x,a(f))\ 
/e[o,i] 

< sup \(w,a(f)}\ + sup \(y-x,a(f))\ 
/e[o,i] /e[o,i] 

< SUp | (10,0(7)) | +T 

/e[o.i] 

< (1 + 1rf l )T < 3t, with high probability. (5.4) 



10 



The second inequality follows from the optimality conditions for (3.5). It is shown in Appendix C 
of [7] that the penultimate inequality holds with high probability. 

Therefore, to complete the proof, it suffices to show that the other terms on the right hand side 
of (|5.3l) are O(^). While there is no exact frequency recovery in the presence of noise, we can hope 
to get the frequencies approximately right. Hence, we expect that the integral in the far region can 
be well controlled and the local integrals of the difference measure in the near regions are also small 
due to cancellations. Next, we utilize the properties of the dual polynomial in Theorems [4] and 
another polynomial given in Theorem [5] in Appendix [B] to show that the zeroth and first moments of 



v may be controlled in terms of the other two quantities in (5.3) to upper bound the error rate. The 
following lemma is similar to Lemmas 2.2 and 2.3 in |2j, but we have made several modifications 
to adapt it to our signal and noise model. For completeness, we provide the proof in Appendix \C\ 

Lemma 2. There exists numeric constants Cq and C\ such that 



kr 


+ I2+ f 




n 


Jf 


kr 


+ h+ [ 




n 


Jf 



h<Ci + j F W\(df) 

All that remains to complete the proof is an upper bound on I 2 and f F \v\{df). The key idea 
in establishing such a bound is deriving upper and lower bounds on the difference ||-Pt c ( z/ )||tv — 
1 1 -Pr^) 1 1 TV between the total variation norms of v on and off the support. The upper bound can 
be derived using optimality conditions. We lower bound ||i- ) T c (^)||TV — ||-Pt(^)||tv using the fact 
that a constructed dual certificate Q has unit magnitude for every element in the support T of 
Pt{v) whence we have ||-Pt(^)||tv = $fQ{f) v {df)- A critical element in deriving both the lower 
and upper bounds is that the dual polynomial Q has quadratic drop in each near regions Nj and 
is bounded away from one in the far region F. Finally, by combing these bounds and carefully 
controlling the regularization parameter, we get the desired result summarized in the following 
lemma. The details of the proof are fairly technical and we leave them to Appendix |Dj 

Lemma 3. Let r = r\o-\Jn log(n). Ifrj>l is large enough, then there exists a numerical constant 
C such that, with high probability 



[ H(df) + I 2 < 
Jf 



Ckr 



n 

Putting together Lemmas [TJ [2] and [3j we finally prove our main theorem: 



- e 2 ^ 
n n 

< 1 /<;(/) 1 (C'vkT 



< 
< 



n \ n 

Wjy^ckr 

n n 
Ckr 2 



J WW + Io + h + h} 

+ c 2 [ W\(df) + c 3 i 2 

Jf 



n 2 



J ^ fclogCw) 



n 
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The first three inequalities come from successive applications of Lemmas 1, 2 and 3 respectively. 



The fourth inequality follows from ( |5.4[ ) and the fifth by our choice of r according to Eq. (3.8) 
This completes the proof of Theorem [TJ 

5.4 Proof of Theorem [2] 

The first two statements in Theorem[2]are direct consequences of Lemma|3j For (iii.), we follow [22 
and use the dual polynomial Qj(f) = (Qj, a (f)) constructed in Lemma 2.2 of |22j| which satisfies 

Qjifi) = i 
|l-QK/)l ^ iJciif-fjfjeNj 

|Qf(/)| < n 2 C[(f-f f ) 2 JeN f ,j'^j 
\Q*(f)\ < C 2 ,feF. 

We note that Cj — YI/^n- ^ = In- v (df)- Then, by applying triangle inequality several times, 



Hdf) 



N, 



< 



< 



< 



N 



Ql(fXdf) 



Ql(fXdf) 



+ 



+ 



+ 



[l-Q*Af))v{df) 



N, 



n<: 



Qj(fXdf) 



(l-QUf)Hdf) 



QUfHdf) 



k 

+ E / M(4f) + / |i - Q%f)\ Hdf)\ . 



We upper bound the first term using Lemma [5] in Appendix [B] which yields 



QUfHdf) 



< 



Ckr 



n 



The other terms can be controlled using the properties of Q*: 

Q^fHdf) 



<c 2 1 w\(df) 

F 
k 



I \Q)tf)\\v\{df)+ I \l-Q*(f)\\v\(df)<C[ V I n 2 {f-f J ,f\v\(df) = C l I 2 



i'=i 



Using Lemma [3J both of the above are upper bounded by ( =^ L . Now, by combining these upper 
bounds, we finally have 



< 



C 3 kT 



n 



This shows part (iii) of the theorem. Part (iv) can be obtained by combining parts (ii) and (iii). 
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6 Experiments 



In [7j, we demonstrated with extensive experiments that AST outperforms classical subspace algo- 
rithms in terms of mean squared estimation error. In the experiments here, we focus on frequency 
localization and compare the performance of AST, MUSIC |8j| and Cadzow's method [9j under 
various choices of number of frequencies, number of samples and signal to noise ratios (SNRs). 

We adopt the same experimental setup as in u\ and reproduce the description of experiments 
here for convenience. We generated k normalized frequencies fi, ■ ■ ■ ,fk uniformly randomly chosen 
from [0,1] such that every pair of frequencies are separated by at least l/2n. The signal x* € 



C n is generated according to (1.1) with k random amplitudes independently chosen from x 2 (l) 
distribution (squared Gaussian). All of our sinusoids were then assigned a random phase (equivalent 
to multiplying q by a random unit norm complex number). The observation y is produced by adding 
complex white gaussian noise w such that the input signal to noise ratio (SNR) is —10, —5, 0, 5, 10, 15 
or 20 dB. We compared the average value of the following metrics of the various algorithms in 20 
random trials for various values of number of observations (n = 64,128,256), and number of 
frequencies (k = n/4, n/8, n/16). 

AST needs an estimate of the noise variance a 2 to pick the regularization parameter according 



to (3.8). In our experiments, we do not provide our algorithm with the true noise variance. Instead, 
we can construct an estimate for a with the following heuristic. We formed the empirical autocor- 
relation matrix using the MATLAB routine corrmtx using a prediction order n/3 and averaging 



the lower 25% of the eigenvalues. We then use this estimate in equation (3.8) to determine the 
regularization parameter. See [7] for more details. 

We implemented AST using the Alternating Direction Method of Multipliers (ADMM, see for 
example, 30 , or [7] for the specific details). We used the stopping criteria described in [30 and 



set p = 2 for all experiments. We use the dual solution z to determine the support of the optimal 
solution x. Once the frequencies fi are extracted, we ran the least squares problem minimizeaHt/o! — 
y\\ 2 where Uji = exp(i27rj/i) to obtain debiased estimates of the amplitudes. 



We implemented Cadzow's method as described by the pseudocode in 31 , and MUSIC [8] using 
the MATLAB routine rootmusic. These algorithms need an estimate of the number of sinusoids. 
Rather than implementing a heuristic to estimate k, we fed the true k to our solvers. This provides 
a significant advantage to these algorithms. On the contrary, AST is not provided the true value 
of k, and the noise variance a 1 required in the regularization parameter is estimated from y. 

Let {q} and {f{\ denote the amplitudes and frequencies estimated by any of the algorithms 
- AST, MUSIC or Cadzow. We use the following error metrics to characterize the frequency 
localization of various algorithms: 

(i) Sum of the absolute value of amplitudes in the far region F, mi = Yli-f 1^1 

(ii) The weighted frequency localization error, ni2 = J^i-j^n- l^l{ mm /jeT d(fj, fi)} 2 

(iii) Error in approximation of amplitudes in the near region, ms = Cj — X^ /^at Q 

These are precisely the quantities that we prove tend to zero in Theorem [2] 

To summarize the results, we first provide performance profiles to summarize the behavior of 
the various algorithms across all of the parameter settings. Performance profiles provide a good 
visual indicator of the relative performance of many algorithms under a variety of experimental 



13 



conditions |32| . Let V be the set of experiments and let e s (p) be the value of the error measure e 
of experiment p £ V using the algorithm s. Then the ordinate P s (f3) of the graph at f3 specifies 
the fraction of experiments where the ratio of the performance of the algorithm s to the minimum 
error e across all algorithms for the given experiment is less than f3, i.e., 



#{p€V : e s (p) < f3mm s e s (p)} 
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Figure 1: Performance Profiles for AST, MUSIC and Cadzow. (a) Sum of the absolute value of amplitudes 
in the far region (mi) (b) The weighted frequency localization error, to 2 (c) Error in approximation of 
amplitudes in the near region, 1713 

The performance profiles in Figure [T] show that AST is the best performing algorithm for all the 
three metrics. AST in fact outperforms MUSIC and Cadzow by a substantial margin for metrics 
mi and 771,2. 

In Figure [2j we display how the error metrics vary with increasing SNR for AST, MUSIC and 
Cadzow. We restrict these plots to the experiments with n = 256 samples. These plots demonstrate 
that AST localizes frequencies substantially better than MUSIC and Cadzow even for low signal to 
noise ratios as there is very little energy in the far region of the frequencies (mi) and has the smallest 
weighted mean square frequency deviation (m.2). Although we have plotted the average value in 
these plots, we observed spikes in the plots for Cadzow's algorithm as the average is dominated by 
the worst performing instances. These large errors are due to the numerical instability of polynomial 
root finding. 



7 Conclusion and Future Work 

In this paper, we demonstrated stability of atomic norm regularization by analysis of specific prop- 
erties of the atomic set of moments and the associated dual space of trigonometric polynomials. The 
key to our analysis is the existence and properties of various trigonometric polynomials associated 
with signals with well separated frequencies. 

Though we have made significant progress at understanding the theoretical limits of line-spectral 
estimation and superresolution, our bounds could still be improved. For instance, it remains open 
as to whether the logarithmic term in Theorem [l] can be improved to log(n//c). Deriving such an 
upper bound or improving our minimax lower bound would provide an interesting direction for 
future work. 
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Figure 2: For n = 256 samples, the plots from left to right in order measure the average value over 20 
random experiments for the error metrics mi, mi and respectively. The top, middle and the bottom third 
of the plots respectively represent the subset of the experiments with the number of frequencies k = 16, 32 
and 64. 
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Additionally, it is not clear if our localization bounds in Theorem [2] have the optimal dependence 
on the number of sinusoids k. For instance, we expect that the condition on signal amplitudes for 
approximate support recovery should not depend on k, by comparison with similar guarantees 
that have been established for Lasso [33]. We additionally conjecture that for a large enough 
regularization parameter, there will be no spurious recovered frequencies in the solution. That is, 
there should be no non-zero coefficients in the "far region" F in Theorem [2] Future work should 
investigate whether better guarantees on frequency localization are possible. 

References 

E. Candes and C. Fernandez- Granda, "Towards a mathematical theory of super-resolution," 
arXiv preprint arXiv.1203.5871, 2012. 

E. Candes and C. Fernandez-Granda, "Super-resolution from noisy data," arXiv preprint 
arXiv: 121 1.0290, 2012. 

F. Bunea, A. Tsybakov, and M. Wegkamp, "Sparsity oracle inequalities for the Lasso," Elec- 
tronic Journal of Statistics, vol. 1, pp. 169-194, 2007. 

E. J. Candes and M. A. Davenport, "How well can we estimate a sparse vector?," Applied and 
Computational Harmonic Analysis, 2012. 

P. J. Bickel, Y. Ritov, and A. B. Tsybakov, "Simultaneous analysis of lasso and dantzig 
selector," The Annals of Statistics, vol. 37, no. 4, pp. 1705-1732, 2009. 

E. J. Candes and Y. Plan, "Near-ideal model selection by t\ minimization," The Annals of 
Statistics, vol. 37, no. 5A, pp. 2145-2177, 2009. 

B. N. Bhaskar, G. Tang, and B. Recht, "Atomic norm denoising with applications to line 
spectral estimation," arXiv preprint arXiv: 1204-0562, 2012. 

R. Schmidt, "Multiple emitter location and signal parameter estimation," IEEE Trans, on 
Antennas and Propagation, vol. 34, no. 3, pp. 276-280, 1986. 

J. Cadzow, "Spectral estimation: An overdetermined rational model equation approach," Proc. 
of the IEEE, vol. 70, no. 9, pp. 907-939, 1982. 

R. Roy and T. Kailath, "ESPRIT - estimation of signal parameters via rotational invariance 
techniques," IEEE Trans, on Acoustics, Speech and Signal Processing, vol. 37, no. 7, pp. 984- 
995, 1989. 

R. Vautard, P. Yiou, and M. Ghil, "Singular-spectrum analysis: A toolkit for short, noisy 
chaotic signals," Physica D: Nonlinear Phenomena, vol. 58, no. 1, pp. 95-126, 1992. 

R. de Prony, "Essai experimental et analytique," J. Ec. Polytech. (Paris), vol. 2, pp. 24-76, 
1795. 

M. Vetterli, P. Marziliano, and T. Blu, "Sampling signals with finite rate of innovation," IEEE 
Trans, on Signal Processing, vol. 50, no. 6, pp. 1417-1428, 2002. 



16 



[14] P. Stoica and N. Arye, "MUSIC, maximum likelihood, and Cramer-Rao bound," IEEE Trans, 
on Acoustics, Speech and Signal Processing, vol. 37, no. 5, pp. 720-741, 1989. 

[15] P. Stoica and R. Moses, Spectral analysis of signals. Pearson/Prentice Hall, 2005. 

[16] D. Malioutov, M. Cetin, and A. Willsky, "A sparse signal reconstruction perspective for source 
localization with sensor arrays," IEEE Trans, on Signal Processing, vol. 53, no. 8, pp. 3010- 
3022, 2005. 

[17] S. Bourguignon, H. Carfantan, and J. Idier, "A sparsity-based method for the estimation 
of spectral lines from irregularly sampled data," IEEE Journal of Selected Topics in Signal 
Processing, vol. 1, no. 4, pp. 575-585, 2007. 

[18] R. Baraniuk, V. Cevher, M. Duarte, and C. Hegde, "Model-based compressive sensing," IEEE 
Trans, on Information Theory, vol. 56, no. 4, pp. 1982-2001, 2010. 

[19] G. Zweig, "Super-resolution fourier transforms by optimisation, and isar imaging," in IEE 
Proc. on Radar, Sonar and Navigation, vol. 150, pp. 247-52, IET, 2003. 

[20] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, "The convex geometry of linear 
inverse problems," Foundations of Computational Mathematics, vol. 12, no. 6, pp. 805-849, 
2012. 

[21] J.-M. Azais, Y. De Castro, and F. Gamboa, "Spike detection from inaccurate samplings," 
arXiv preprint arXiv.1301.5813, 2013. 

[22] C. Fernandez-Granda, "Support detection in super-resolution," arXiv preprint 
arXiv.1302.3921, 2013. 

[23] E. J. Candes, J. Romberg, and T. Tao, "Robust uncertainty principles: exact signal recon- 
struction from highly incomplete frequency information," IEEE Trans. Inform. Theory, vol. 52, 
no. 2, pp. 489-509, 2006. 

[24] E. Candes and B. Recht, "Exact matrix completion via convex optimization," Foundations of 
Computational Mathematics, vol. 9, no. 6, pp. 717-772, 2009. 

[25] G. Raskutti, M. J. Wainwright, and B. Yu, "Minimax rates of estimation for high-dimensional 
linear regression over l q balls.," IEEE Transactions on Information Theory, vol. 57, no. 10, 
pp. 6976-6994, 2011. 

[26] D. L. Donoho and M. Elad, "Optimally sparse representation in general (nonorthogonal) dic- 
tionaries via 1 minimization," Proceedings of the National Academy of Sciences, vol. 100, no. 5, 
pp. 2197-2202, 2003. 

[27] G. Tang, B. N. Bhaskar, P. Shah, and B. Recht, "Compressed sensing off the grid," arXiv 
preprint arXiv.1207.6053, 2012. 

[28] Y. de Castro and F. Gamboa, "Exact reconstruction using Beurling minimal extrapolation," 
Journal of Mathematical Analysis and Applications, vol. 395, no. 1, pp. 336 - 354, 2012. 



17 



[29] E. J. Candes and Y. Plan, "A probabilistic and RIPless theory of compressed sensing," IEEE 
Trans, on Information Theory, vol. 57, no. 11, pp. 7235-7254, 2011. 

[30] S. Boyd, N. Parikh, B. P. E. Chu, and J. Eckstein, "Distributed optimization and statisti- 
cal learning via the alternating direction method of multipliers," Foundations and Trends in 
Machine Learning, vol. 3, pp. 1-122, December 2011. 

[31] T. Blu, P. Dragotti, M. Vetterli, P. Marziliano, and L. Coulot, "Sparse sampling of signal 
innovations," Signal Processing Magazine, IEEE, vol. 25, no. 2, pp. 31-40, 2008. 

[32] E. Dolan and J. More, "Benchmarking optimization software with performance profiles," Math- 
ematical Programming, vol. 91, no. 2, pp. 201-213, 2002. 

[33] E. J. Candes and Y. Plan, "Near-ideal model selection by 1 minimization," The Annals of 
Statistics, vol. 37, no. 5A, pp. 2145-2177, 2009. 

[34] A. Schaeffer, "Inequalities of a. markoff and s. bernstein for polynomials and related functions," 
Bull. Amer. Math. Soc, vol. 47, pp. 565-579, 1941. 



A Proof of Lemma [T] 



We first split the domain of integration into the near and far regions. 

k 



x(fHdf) 



o 



< 



< 



*(/)"(/) 



k 

W)IL / MC4D + E 



x(fMdf) 
x(fHdf) 



(A.l) 



by using Holder's inequality for the last inequality. Using Taylor's theorem, we may expand the 
integrand X(f) around as 

X(f) = X(fj) + (f- tfX'ifj) + l -x"{^){f - f 3 f 
for some £j G Nj. Thus, 

< sup hx"(0\(f-fj) 2 
<\n 2 \\X(f)\Uf-f 3 r, 

where for the last inequality we have used a theorem of Bernstein for trigonometric polynomials 
(see, for example j34]): 



I^U^niixc/oiu 
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As a consequence, we have 



/ X(f)u(df) 


< \X(fj)\ 


[ Hdf) 




[ (f-fMdf) 


Jn 3 




Jn 3 







+ -n 2 \\X(f)\\ c 



N, 



(f-fjfH(df) 



<\\X(f)\\oo[F + Il+I}). 



Substituting back into (A.l) yields the desired result. 



B Some useful lemmas 

In addition to Theorem |4j we recall another result in [2] where the authors show the existence of a 
trigonometric polynomial Q\ that is linear in each N* which is also an essential ingredient in our 
proof. 



Theorem 5 (Lemma 2.7 in [2]). For any fi, . . . , /& satisfying (1.2) and any sign vector v G C fc with 
\vj\ = 1, there exists a polynomial Qi = (qi,a(f)) for some q\ G C n with the following properties: 



1. For every f G Nj, there exists a numerical constant C\ such that 



\QiV)-vj{f-fj)\<^cHj-fj? 

2. For f G F, there exists a numerical constant C\ such that 

IQi(/)l < ^- 

n 



(B.l) 



(B.2) 



We will also need the following straightforward consequence of the constructions of the polyno- 
mials in Theorem |4| Theorem [5j and Section 5.4 



Lemma 4. There exists a numerical constant C such that the constructed Q{f) in Theorem^ 
Ql(f) i n Theorem^ and Q'j(f) in Section 5.4 satisfy respectively 



I' 1 Ck 

\\Q(f)h--= / \Q(f)\df< " 

Jo 



n 



\Q1h< 



Ck 



n 



(B.3) 
(B.4) 
(B.5) 



Proof. We will give a detailed proof of (B.3), and list the necessary modifications for proving (B.4) 



and (B.5). The dual polynomial Q(f) constructed in fl| is of the form 

qu) = Y, a 3 K (f-fi) + Y,p> K '(f-M 

fi*T fjET 



(B.6) 
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where K (/) is the squared Fejer kernel (recall that m = (n — l)/2 

K{f) = 



'sin((f + l)vr/)\ 4 



^(f + 1) sin(vr/); 
and for n > 257, the coefficients a £ C fc and j3 E C fc satisfy [TJ Lemma 2.2] 



C/3 



< 

00 - n 



for some numerical constants C a and Cg. Using (B.6) and triangle inequality, we bound ||Q(/)||i 
as follows: 

\\Q(f)h = C\QU)\df 

Jo 

< fclHloo/ \K(fM + kW\oa f \K'U)\df (B.7) 

Jo Jo 

< C a k [ \K(f)\df + ^kf \K'(f)\df, (B.8) 

Jo n Jo 

To continue, note that /„* \K(f)\df = $ \G(f)\ 2 df =: where G(f) is the Fejer kernel, 

since K (/) is the squared Fejer kernel. We can write 

where g\ = (y + 1 — |/|)/(y + l) 2 . Now, by using Parseval's identity, we obtain 

m/2 



\K{f)W 



\ \G(f)\ 2 df= Yl \9i 
Jo , , n 



2 

yi\ 

l=-m/2 



m/2 

- + i) +2 j;( 

2 v \ z=i 



+ D 4 



1 / 9 m / 2 

(f + i) 4 V u ; 

< - (B.10) 

n 

for some numerical constant C when n = 2m + 1 > 10. 

Now let us turn our attention to f \K'(f)\df. Since K(f) = G(f) 2 , we have 

\K'(f)\df = 2 f 1 \G(f)G'(f)\df < 2||G(/)|| 2 ||G'(/)|| 2 (B.ll) 
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We have already established that ||Cr(/)||2 < C/n and we will now show that ||Gr'(/)[|| < C'n. 



Differentiating the expression for G(f) in (B.9), we get 

m/2 

G'(f) = -2vri £ Igte-W 

l=-m/2 

Therefore, by applying Parseval's identity again, we get 

m/2 



\G'(f)\\l 



Plugging back into (B.ll) yields 



= ^ £ l*\ 9l f 

l=-m/2 
m/2 

^ 2 2 I |2 

< 7T m 2^ 1^1 

Z=-m/2 

< C'n 



\K'(f)\df < C 



(B.12) 



for some constant C. Combining (B.12) and (B.10) with (B.8) gives the desired result in (B.3). 



The dual polynomial Qi(f) is also of the form (B.6) with coefficient vectors a\ and /3i, which 
satisfy [2j Proof of Lemma 2.7] 

ll«l||oo < 



'1 oo 



< 






n 






< 






n 2 



Combining the above two bounds with (B.7), (B.12) and (B.10) gives the desired result in (B.4). 
The last polynomial also has the form (B.6) with coefficient vectors a* and /3*. According 



to 22, Proof of Lemma 2.2], these coefficients satisfy 



< 



11 



which yields (B.5) following the same argument leading to (B.3). 



□ 



Using Lemma |4j we can derive the estimates we need in the following lemma. 
Lemma 5. Let v = ft, — u be the difference measure. Then, there exists numerical constant C > 



such that 



Q(f>(df) 



Qi(fHdf) 



Ql(fXdf) 



< 



< 



< 



Ckr 

n 
Ckr 

n 2 
Ckr 

n 



(B.13) 
(B.14) 
(B.15) 



21 



Proof. Let Qq = (qo,a(f)) be a general trigonometric polynomial associated with qo £ C n . Then, 



QoUHdf) = / (q ,a(f))v(df) 



(q , / a(f)u(df)) 
J o 

= \(Qo,e)\ 

= \(Qo(f),E(f))\ 

<||Qo(/)||l||^(/)||oo. 

Here we use Parseval's identity in the second to last step and Holder's inequality in the last in- 
equality. Then, the result follows by using Lemma [4] and (5.4). □ 



We also need the following consequence of the optimality condition of AST from [7J Lemma 2]: 
Proposition 1. 

TplU < T||a;*IU + (w,x — x*) (B.16) 

C Proof of Lemma [2] 

Consider the polar form 



f Hdf) = 


[ Kdf) 




Jn 3 



Set Vj = e l9j and let Q(f) be the dual polynomial promised by Theorem [4] for this v. Then, we 
have 



v(df) 



N, 



e~ i9 iv(df) 



N, 



QU)v 



(df) + I (e 

JNi 



-16 j 



Q(f)Hdf) 



Summing over j = 1 , . . . , k yields 



*. = £ 

j'=i 
k 



Hdf) 



N, 



4=1 J Nj ?=1 JN 3 



< 



Q(f>(df) 





Ckr 



+ / \u\(df) + C' a l2, using triangle inequality and (5.2) 
Jf 



n 



< + / W\(df) + C' a I 2 , using (B.13) 



(C.l) 
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We use a similar argument for bounding I\ but this time use the dual polynomial Qi(f) guaranteed 
by Theorem [5] Again, start with the polar form 



U-fMdf) 



N, 



(f-fjHdf) 



Set Vj = e l9j in Theorem [5] to obtain 



n 



n 



A', 



(«i(/-/j)-Qi(/)M4f)+n / Qi(fHdf) 



N, 



Summing over j = 1, . . . , k yields 

k 



n 



E/ ( v i(f-fi)-Qi(f)>W) + n Y, QiVMtf) 
3=1 - /v 1 1 - /v 



< Clh + n 



Qi(fMdf) 



+ n 



Qi(f>(df) 



<clh + ^ + cl [ w\(df) 

n J F 



(C.2) 



have used (B.14) and (B.2). Equations (C.l) and (C.2) complete the proof. 



For the first inequality, we have used (B.l) and triangle inequality, and for the last inequality, we 



D Proof of Lemma [3] 

Denote by Pt{v) the projection of the difference measure v on the support set T = {fi, ■ ■ ■ , fk} 
of x* so that Pt(v) is supported on T. Then, setting Q(f) the polynomial in Theorem [4] that 
interpolates the sign of Pt(is), we have 



\\P T {v) 



TV 



Q{f)P T {u)(df) 



< 



< 



Q(fHdf) 





Ckr 
n 



+ 



Q(fMdf) 



+ E 



Q(fHdf) 



+ 



QUHdf) 



where for the first inequality we used triangle inequality and for the last inequality we used (B.13) 
The integration over F is can be bounded using Holder's inequality 



QU)y(df) 



< (1-C 6 ) / W\(df) 
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We continue with 



Q(fHdf) 



< 



< 



\QU)MW) 



(l-ln 2 C a (f-f 3 ) 2 )\u\(df) 



< I \v\(df)-C a P 2 . 



As a consequence, we have 

Ckr 



\\P T {u) 



TV 



< 



< 



n 
Ckr 



+ E / M(#) - c - h + 1 1 - c ^ / M(#) 

/eT ^/{/ 3 -} Jf 



n 



/J6T 



/ H(4f)-c«i'2-c 6 / |i/|(4f) 



| P T c || ■ 



or equivalently, 



\\P T o{v)\\ TV - ||PtHI|tv > CJ 2 + C b J \v\{df) 
Now, we appeal to Proposition [T] and obtain 

\\x\\a < \\x*\\a ~ (w,e)/r 



Ckr 



n 



and thus 



Using Lemma [TJ 



IAIItv < 1 1 A* 1 1 tv + |(«;, e>|/T. 



\{w,e)\ = \{w, / a{f)u(df)\) 
Jo 

(w,a(f))u(df) 
/Ckr 

<||Ka(/)}|U +I + h + I 2 

\ n 

/Ckr 

< 2r]- l T [— + h + h+I 2 



n 



< cn-W ( - 

\ n 



+ h + J \v\(df) 



(D.l) 



(D.2) 



(D.3) 



(D.4) 



with high probability, where for the penultimate inequality we used our choice of r and || (w, a(f)) ||oo < 
2t]~ 1 t with high probability, a fact shown in Appendix C of [7]. Substituting (D.4) in (D.2), we 
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get 



v + c^r^ + h + J \v\(4f)\ 

> IIAIItv 

= + ^IItv 

> H/xIItv-II^tMIItv+II^HIItv 

Canceling ||//||tv yields 

\\P Tc (v)\\ TY -\\P T (v)\\ TY <Cr ] - 1 T(—+I 2 + [ \v\{df) 

V n Jf 



As a consequence of (D.l) and (D.5), we get, 



C{\ + rT 1 )—>{C h -rT l C) / \u\{df) + {C a -r 1 - l C)h 
n Jf 

whence the result follows for large enough rj. 



(D.5) 
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